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Abstract 

This paper introduces a procedure based on genetic programming to 
evolve XSLT programs (usually called stylesheets or logicsheets) . XSLT is 
a general purpose, document-oriented functional language, generally used 
to transform XML documents (or, in general, solve any problem that can 
be coded as an XML document). The proposed solution uses a tree repre- 
sentation for the stylesheets as well as diverse specific operators in order 
to obtain, in the studied cases and a reasonable time, a XSLT stylesheet 
that performs the transformation. Several types of representation have 
been compared, resulting in different performance and degree of success. 

Keywords: genetic programming, XML, XSLT, JEO, DREAM, con- 
strained evolutionary computation, document transformation 



1 Introduction 

XML (extensible Markup Language, [10l [HI E] encompasses a set of specifica- 
tions with different semantics but a common syntactic structure; XML docu- 
ments must have a single root element and paired tags, with attributes, which 
can be nested. Thus, all XML documents have a tree structure (the so-called 
Document Object Model -DOM- tree) with a single root element that con- 
tains (encapsulates) all the contents of the document. Optionally, the syntax or 
semantics of elements and attributes may be determined by a Document Type 
Definition (DTD) or XSchcma (equivalent concept that uses XML for its def- 
inition, [9J), in which case the document can be validated; however, in most 
applications what is called well-formed XML is more than enough. 

Since the IT industry has settled in different XML dialects as information 
exchange format, there is a business need for programs that transform from one 

•Supported by projects TIN2007-68083-C02-01, and P06-TIC-02025. 
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<?xml version=" 1 . 0" ?> 
<html> 
<head> 

<title>Test page</title> 
</head> 
<body> 

<hl>Test page</hl> 
<h2>First test</h2> 
<p>Some stuff <br /> 
Some more stuff </p> 
<h2>Second test</h2> 
<h2>That's another test</h2> 
</body> 
</html> 

Figure 1: An example simplified XHTML document. Looks like HTML, but it 
has an XML syntax: mainly, tags must be strictly paired. 

XML set of tags to another, extracting information or combining it in many 
possible ways; a typical example of this transformation could be the extraction 
of news headlines from a newspaper in Internet that uses XHTML (An XML 
version of the Hypertext Markup Language (HTML) used in web pages, see 
figure Q}. 

XSLT stylesheets (XML Stylesheet Language for Transformations) [7], also 
called logicsheets, are designed for this purpose: applied to an XML document, 
they produce another. There are other possible solutions: programs written in 
any language that work with text as input and output, programs using regular 
expressions or SAX filters [18] , that process each tag in a XML document in a 
different way, and do not need to load into memory the whole XML document. 
However, they need external languages to work, while XSLT is a part of the 
XML set of standards, and, in fact, XSLT logicsheets are XML documents, 
which can be integrated within an XML framework; that is why XSLT is, if not 
the most common, at least a quite usual way of transforming XML documents. 

The amount of work needed for logicsheet creation is a problem that scales 
quadratically with the quantity of initial and final formats. For n input and 
m output formats, n x m transformations will be needed^- Considering that 
each conversion is a hand-written program and the initial and final formats can 
vary with certain frequency, any automation of the process means a considerable 
saving of effort on the part of the programmers. 

The objective of this work is to find the XSLT logicsheet that, from one 
input XML document, is able to obtain an output XML document that contain 
exclusively the information that is considered important from original XML 

1 I{ an intermediate language is used, just n + m, but this increases the complexity of the 
transformation and decreases its speed. 
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documents. This information may be ordered in any possible way, possibly in 
an order different to the input document. This logicsheet will be evolved using 
evolutionary operators that will take into account the structure of the program 
and its components. This could be considered, in a way, Genetic Programming, 
since XSLT logicsheets are XML documents that have a tree structure, but, since 
they have to follow grammatical conventions, it is better to guide evolution using 
specific operators than allow all type of GP operators. 

Thus, XSLT provides a general mechanism for the association of patterns 
in the source XML document to the application of format rules to these ele- 
ments, but in order to simplify the search space for the evolutionary algorithm, 
only three instructions of XSLT will be used in this work: template, which sets 
which XML fragment will be included when the element in its match attribute 
is found; apply-templates, which is used to select the elements to which the 
transformation is goingto be applied and delegate control to the corresponding 
templates; and value-o£], which simply includes the content of an XML doc- 
ument into the output file. This implies also a simplification of the general 
XML-to-XML transformation problem: we will just extract information from 
the original document, without adding new elements (tags) that did not exist 
in the original document. In fact, this makes the problem more similar to the 
creation of an scraper, or program that extracts information from legacy web- 
sites or documents. Thus, we intend this paper just as a proof of concept and 
initial performance measurement, whose generalization, if not straightforward, 
is at least possible. 

XSLT stylesheets combines XSLT commands with embedded XPath [8] ex- 
pressions to map XML documents into others. For instance, to extract all H2 
elements in the XHTML example shown in figure[T]both XSLT logicsheets shown 
in figures [5] and [3] would be valid, but the second one is simpler, making use of a 
single XPath expression, while the other one would obtain the same result using 
only XSLT templates. In addition, XPath provides a way to select groups of el- 
ements (node-sets) and to filter them by using predicates allowing, for instance, 
to select the element that occupies a certain position within a node-set. 

Previously, we published the initial XSLT evolution experiments |19j . test- 
ing different document structures and operators. In this paper we will try to 
improve on those results, choosing XSLT stylesheet structure and operators so 
that convergence to solution is assured. We will try also to examine the influence 
of the different operator rates on the result. 

The rest of the paper is structured as follows: the state of the art is pre- 
sented in section [5] Section [3] describes the solution presented in this work. 
Experiments are described in section [4] with the automatic generation of XSLT 
stylesheets for two examples and finally the conclusions and possible lines of 
future work are presented in section [5] 

2 With text used for easy visualization of the final document 
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<?xml version=" 1 . 0"?> 

<xsl:stylesheet version=" 1 . 0" xmlns:xsl="http: //www.w3. org/1999/XSL/Transf orm"> 
<xsl:output method="xml" indent='y es '/> 
<xsl:template match="/" > 
<output> 

<xsl:apply-templates select='html' /> 
< /output > 

< / xsl:template> 
<xsl:template match='html'> 

<xsl:apply-templates select='body'/> 

< / xsl:template> 
<xsl:template match='body'> 

<xsl:apply-templates select='h2'/> 

< / xsl:template> 
<xsl:template match='h2'> 

<line><xsl:value-of select='.' /></line> 

< / xsl:template> 
</xsl:stylesheet> 

Figure 2: Example XSLT logicsheet that extracts the content of h2 tags to an 
XML file; each content will be contained in line tags. This example shows an 
structure with templates for all elements in the path that leads to the element 
being extracted. 

2 State of the art 

So far, very few papers about applying genetic programming techniques to the 
automatic generation of XSLT logicsheets have been published; one of these, by 
Scott Martens [13], presents a technique to find XSLT stylesheets that trans- 
form a XML file into HTML by using genetic programming. Martens works on 
simple XML documents, like the ones shown in its article, and uses the UNIX 
diff function as the basis for its fitness function. He concludes that genetic 
programming is useful to obtain solutions to simple examples of the problem, 
but it needs unreasonable execution times for complex examples and might not 
be a suitable method to solve this kind of problems. However, computing has 
changed a lot in the latest seven years, and the time for doing it is probably 
now, as we attempt to prove in this paper. 

Unaware of this effort, and coming from a completely different field, Schmidt 
and Waltermann [15] approached the problem taking into account that XSLT 
is a functional language, and using functional language program generation 
techniques on it, in what they call inductive synthesis. First they create a 
non-recursive program, and then, by identifying recurrent parts, convert it into 
a recursive program; this is a generalization of the technique used to gener- 
ate programs in other programming languages such as LISP [H [H] , and used 
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<?xml version=" 1 . 0"?> 

<xsl:stylesheet version=" 1 . 0" xmlns:xsl="http: //www.w3. org/1999/XSL/Transf orm"> 
<xsl:output method="xml" indent='yes'/> 
<xsl:template match="/" > 
<output> 

<xsl:apply-templates select='/html/body/h2'/ > 
< /output > 

< / xsl:template> 

<xsl:template match='h2'> 

<line><xsl:value-of select^'.' /></line> 

< / xsl:template> 
</xsl:stylesheet> 

Figure 3: Another example XSLT logicsheet for extracting h.2 tags, in this case 
using an XPath expression to process just the needed nodes. 

thoroughly since the eighties [5]. 

A few other authors have approached the general problem of generating 
XML document transformations knowing the original and target structure of 
the documents, as represented by its DTD: Leinonen et al. [HI [TT] have pro- 
posed semi-automatic generation of transformations for XML documents; user 
input is needed to define the label association. There are also freeware pro- 
grams that perform transformations on documents from a XSchema to another 
one. However, they must know both XSchcmata in advance, and are not able 
to accomplish general transformations on well formed XML documents from 
examples. 

The automatic generation of XSLT logicshects is also a super-set of the prob- 
lem of generating wrappers, that is, programs that extract information from 
websites, such as the one described by Ben Milcd ct al. in [3]. In fact, HTML is 
similar in structure to XML (and can actually be XML in the shape of XHTML), 
but these programs do not generate new data (new tags), but only extract infor- 
mation already existing in web sites. This is what applications such as X-Fetch 
Wrapper, developed by Republicad, do. The company that marketed it claims 
that it is able to perform transformation between any two XML formats from 
examples. Anyway, it is not so clear that transformations are that straightfor- 
ward: according to a white paper found at their website, it uses a document 
transformation language. 

3 This company no longer exists, and the product seems to have been discontinued 
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3 Methodology 



XSLT stylesheets have been inserted into tree structures, making them evolve us- 
ing variation operators. Every XSLT stylesheet is evaluated using a fitness func- 
tion that is related to the difference between generated XML and output XML 
associated to the example. The solution has been programmed using JEO [5] , an 
evolutionary algorithm library developed at University of Granada as part of the 
DREAM project Q], which is available at |http : //www. dr-ea-m. org| together 
with the rest of the project. All source code for the programs used to run the ex- 
periments is available from https : //f or ja. rediris . es/websvn/wsvn/geneura/GeneradorXSLT/ 
under an open source licence. 

The generated XML documents are encapsulated within an XML tag whose 
name equals the root element from the input XML; each line uses also the 
tag line, so that we can distinguish easily between intended and unintended 
(generated by default templates, for instance) output lines. Next, structures 
used for evolution and operators applied to them are described. These operators 
work on data structures and XPath queries within them. 

The search space over possible stylesheets is exceedingly large. In addition, 
language grammar must be considered in order to avoid syntactically wrong 
stylesheet generation. Due to this, transformations are applied to predetermined 
stylesheet structures which have been selected, which will be described next, 
along with the operators that will be applied to them. 

3.1 Type 1 structure 

<xsl:template match="/" > 

<xsl:apply-templates select= " /book " /> 

< / xsl:template> 
<xsl:template match="book"> 

<xsl:apply-templates select=" chapter [2] "/> 

<xsl:apply-templates select=" chapter [3] /para [5] "/> 

<xsl:apply-templates select=" chapter [2] / /line"/> 
</xsl:template> 
<xsl:template match="title"> 

<line><xsl:value-of select=" . "/></line> 

< / xsl:template> 

Figure 4: Example of XSLT stylesheet of type 1. 
An example of this structure is shown in figure |H 

• The XSLT logicsheet will have three levels of depth. First level is the root 
clement <xsl:stylesheet> which is common to all XSLT stylesheets. 

• An undetermined quantity of <xsl:template match=...> instructions hangs 
from the root element. 
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• The value of match attribute for the first template that hangs off the 
root will be "/" . This template and its content never will be modified 
by applying evolution operators. The only instruction inside this element 
will be apply-templates, that will have a select attribute whose value will 
be a "/" followed by the root element name, so that the rest of templates 
included in the stylesheet will be processed. 

• The values of the match attributes for the rest of the templates will be sim- 
ply tag names of the input XML. Every value will have an undetermined 
number of children, that will be apply-template or value-of instructions. 
These instructions will have select attributes, whose values will be relative 
XPaths, built over the template path. Those routes would include every 
possible XPath clause, value-of will be used instead of apply-templates 
when the XPath is self ( . ) . 

This kind of structure is quite unconstrained, and relics heavily in the use 
of default templates. If an element is not matched, the default template, which 
includes the text inside the element, is applied. For the example shown in figure 
[U default templates will be used for the para and chapter element, for instance. 

3.2 Type 2 structure 

<xsl:template match="/" > 

<xsl:apply-templates select= " /book " /> 
<xsl:apply-templates select= " /book/ 1 it le " /> 

</xsl:template> 

<xsl:template match="/book"> 

<line><xsl:value-of select=" chapter [2] "/></line> 
<line><xsl:value-of select=" chapter [3] /para [5] "/></line> 
<line><xsl:value-of select=" chapter [2] //line"/x/line> 

< / xsl:template> 

<xsl:template match = " /book/title " > 

<line><xsl:value-of select=" . "/></line> 

< / xsl:template> 



Figure 5: Example of XSLT stylesheet of type 2. 

An example of this structure is shown in figure El The main differences with 
the first one are: 

• The value of the match attribute for the first template that hangs off the 
root will be "/" too, but, in this case it will have an indeterminate number 
of children, that will be all apply-templates instructions, whose values for 
the select attribute will be absolute XPaths in the input XML, that will 
include only single slash-separated tag names. 
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• The values for the match attributes for the other templates that hang from 
the XML root will be the same values that had the select attributes of the 
apply- tern plates in the first template. Therefore, there will be as many 
template instructions as the number of apply-templates in it, and they will 
be in the same order. 

• Every template of the previous section will have an undetermined num- 
ber of children, and all of them will be value-of instructions, where the 
value for the select attribute will be XPath routes relative to the XPath 
absolute route of the father template. These routes would include every 
mechanisms of XPath that the designed operators allow. 

• If the absolute route of a template has a maximum depth level inside the 
XML structure, its only value-of child will select the self element: 

This type of structure is more heavily constrained than Type 1; search is thus 
easier, since less stylesheets are generated; being more constrained, however, 
mutation and crossover are much more disruptive, and has a rougher landscape 
than before. 

3.3 Genetic operators 

The operators may be classified in two different types: the first one consists in 
operators that are common to the two structures and whose assignment is to 
modify the XPath routes that contains the attributes of the XSLT instructions 
(specially apply-template and value-of). Operators in the second group are used 
to modify the XSLT tree structure and take different shape in each of them (so 
that the structure is kept). In order to ensure the existence of the elements 
(tags) added to the XPath expressions and XSLT instruction attributes, every 
time one of them is needed it is randomly selected from the input file. 
The common operators are: 

• XSLTreeMutatorXPath(Add|Mutate|Remove)Filter: Adds, changes number, 
or removes a cardinal filter to any of the XPath tags that allow it. For 
example: 

/book/ chapter — > /book/chapter [4] 
/book/ chapter [2] — > /book/ chapter [4] 
/book/ chapter [2] — > /book/ chapter 

• XSLTTreeMutatorXPathAddBranch: Adds a new tag to an XPath, chosen 
randomly from the existing XPaths, observing the hierarchy of the input 
XML file tree: /book/chapter — > /book/chapter/title 

• XSLTTreeMutatorXPathSetSelf: Replaces the deepest node tag of a XPath 
route by the self node. 
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• XSLTTreeMutatorXPathSetDescendant: Removes one of the intermediate 

tags from a XPath route, remaining a Descendant type node: /book/ chapter/title 
— > /book/ /title. 

• XSLTTreeMutatorXPathRemoveBranch: Removes the deepest element tag 
of a XPath route, ascending a level in the XML tree. For example: 
/book/ chapter/title — ► /book/chapter. 

Other operators change the DOM structure of the XSLT logicshcct, although 
not all of them can be applied to all XSLT structural types: 

• XSLTTreeCrossoverTemplate: Swaps template instructions sub-trees be- 
tween the two parents. This is the only crossover-like operator. 

• XSLTTreeMutator(Add|Mutate|Remove)Template: Inserts, changes or re- 
moves a template. Insertion is performed on the root element matching a 
random element. The choice of this random element gives higher priority 
to the less deeper tags. The position of the new template inside the tree 
will be randomly selected, and its content will be apply-templates or 
value-of tags with the select attribute containing XPath routes relatives 
to the parent template XPath route randomly generated using the XPath 
operators. Change operates on a random node, generating a new sub-tree; 
and removal also eliminates a random template (if there are more than 
two). 

• XSLTTree(Add | Remove) Apply: It adds or removes an xsl:value-of statement 
to a randomly selected template present in the tree. The position of 
the new leaf inside the sub-tree that matches the template also will be 
randomly selected. The new element is randomly generated from the route 
that contains its parent template instruction. The -Remove operator also 
deletes the template node if the removed child was the last remaining one, 
but it is not applied if there is a single template left. 

• XSLTreeMutateApply(l|2): Changes a randomly selected child (1) or cre- 
ates a relative XPath from the one that contains the father xsl template 
and the XPath of the leaf that we are going to modify (2). 

• XSLTreeSetTemplatelMull: It chooses a sub-tree template from the XSLT 
tree and replaces its content by a single instruction <xsl: value-of select^" ." >. 

3.4 Fitness function 

Fitness is related to the difference between the desired and the obtained output, 
but it has been also designed so that evolution is helped. Instead of using a single 
aggregative function, as we did in previous papers |19j . fitness is now a vector 
that includes the number of deletions and additions needed to obtain the target 
output from the obtained output, and the resulting XSLT stylesheet length. 
The XSLT stylesheet is correct only if the number of deletions and additions is 
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0; and minimizing length helps removing useless statements from it. So, fitness 
is minimized by comparing individuals as follows: An individual is considered 
better than another 

• if the number of deletions is smaller, 

• if the number of additions is smaller, being the number of deletions the 
same, or 

• if the length is smaller, being the number of deletions/additions the same. 

Separating and prioritizing the number of deletions helps guide evolution, by 
trying to find first a stylesheet that includes all elements in the target document, 
then eliminating unnecded elements, while, at the same time, reducing length. 



4 Experiments and results 

To test the algorithm we have performed several experiments with different XML 
input files and a single XML output file. The algorithm has been executed thirty 
times for each input XML. Seven different input files have been used for Type 1, 
leaving only the hardest ones for Type 2. The same input file was used for sev- 
eral experiments: a RSS feed from a weblog ( |http : // geneura . wordpres s . com[ ) 
and an XHTML file. All input and output files arc available from our Subversion 



repository: https : //f orja.rediris . es/websvn/wsvn/geneura/GeneradorXSLT/xml/ 



Table 1: Operator priorities (used for the roulette wheel that randomly selects 

the operator to apply) used in the experiments. 

Operator Priority 

XSLTTrccMutatorXPathSctSclf 0T0~ 

XSLTTreeMutatorXPathSetDescendant 0.24 (Only Type 1) 

XSLTTreeMutatorXPathRemoveBranch 0.27 (Type 2) 0.39 (Type 1) 

XSLTTreeMutatorXPathAddBranch 0.99 

XSLTTreeMutatorXPathAddFilter 0.45 (Type 2) 0.53 (Type 1) 

XSLTTreeMutatorXPathMutateFilter 0.64 (Type 2) 0.69 (Type 1) 

XSLTTrccMutatorXPathRcmovcFilter 0.83 

XSLTTrccCrossovcrTcmplatc 0.11 

XSLTTreeMutatorAddTemplate 0.2 

XSLTTreeMutatorMutateTemplate 0.10 

XSLTTreeMutatorRemoveTemplate 0.12 

XSLTTreeAddApply 0.1 

XSLTTreeMutateApplyl 0.1 

XSLTTreeMutateApply2 0.14 

XSLTTreeRemoveApply 0.1 

XSLTTreeSetTemplatcNull 0.03 
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The computer used to perform the experiments is a Centrino Core Duo 
at 1.83 GHz, 2 GB RAM, and the Java Runtime Environment 1.6.0.01. The 
population was 128 for all runs, and the termination condition was set to 200 
generations or until a solution was found and selection was performed via a 
5-Tournament; 30 experiments were run, with different random seeds, for each 
template type and input document. The XML and XSLT processors were the 
default ones included in the JRE standard library. The operator rates used in 
the experiments, which were tuned heuristically, are shown in table [1] 

The new fitness function, in general, yielded better results than previously. 
The algorithm was able to find an adequate XSLT stylesheet within the pre- 
assigned number of generations in most cases. The breakdown of results per 
input file is shown in table [2] 

Table 2: Number of times, out of 30 experiments, a solution is not found within 
the predefined number of generations using type 1 XSLT structure. In general, 
the files are in increasing complexity order, that is why it gets harder to find a 
solution in the latest examples. 



Input file 


Times solution not found 


1 





2 


1 


3 





4 





5 


3 


6 


27 


7 


17 



When a solution was found, the number of generations and time used to find 
it also varies, and is shown in figure [5] In general, the exploration/exploitation 
balance seems to be biased towards exploration. Being such a vast and rough 
search space makes that, after a few initial generations that create stylesheets 
with a small difference form the target, mutations are the main operator at 
work, as is shown in figure [7] 

This last figure also shows a feature of this type of evolution: every change 
has a big influence on fitness, since the introduction of a single statement can add 
several (dozens) lines to output. There is no lineal relation between the number 
of mutations needed to reach a solution and the number of insertions/deletions, 
which also means that a single mutation might have a big influence in fitness, 
while several mutations might be needed to decrease fitness by a single line. 

Some additional experiments have been made using type 2 structure; in 
general, problems which are difficult to attack using type 1 are not so difficult 
using type 2. The same number of experiments have been run (30) for every 
input/output file combination, but only input files #5, ^6 and #7 have been 
used. Results are shown in figure [5] Once again, file #6 presents the highest 
difficulty, but using this structure raises the number of successful experiments 
to 26 (out of 30); it is able to find the solution always for the other two input 
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Computational effort 




Figure 6: Logarithmic boxplot of the number of evaluations needed to find the 
correct stylesheet using Type 1 structure. The difference among easy (the first 
ones) and difficult (the last ones) is quite clear; while just a few hundred of 
evaluations, or at most a few thousands, are needed in files number 1 to 4, 
several thousands, on average, are needed in numbers 5 and 6. Only runs when 
a solution was actually found have been considered to compute averages. 
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Insertion/Deletion averages 




10 20 30 40 50 

Generation 



Figure 7: Evolution of the average number of insertions (black, line on top) 
and deletions (red or light gray) for a run of file #6 which was able to find a 
solution in around 70 generations. The number of deletions decreases in the first 
few generations, but, after that, it proceeds more or less randomly, exploring 
the search space until the solution is found; the number of insertions, however, 
decreases a bit after deletions' dip and then increases slowly. 
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Computational effort, Type 2 stylesheet 
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Figure 8: Boxplot of the number of individuals generated to find the optimum 
for the Type 2 structure. File #6 presents the maximum difficulty, needing on 
average around 2000 individuals. Please note that, even as finding the solution 
more often than using Type 1 structure, the number of evaluations needed is 
smaller. 



14 



files. 

In general, this structure which we have come to call Type 2 beats the 
first one (Type 1) in success rate, number of generations/evaluations needed to 
achieve it, and running time. The only advantage of Type 1 over Type 2 is that 
it has less constraints, and, in some cases, might obtain better results; so, in 
general, our advice would be to try type 2 first, and if it does not yield a good 
result, try also type 1. 



5 Conclusions, discussion, and future work 

In this paper we present the results of an evolutionary algorithm designed to 
search the XSLT logicsheets that is able to make a particular transformation 
from a XML document to another; one of the advantages of this application 
is that resulting logicsheets can be used directly in a production environment, 
without the intervention of a human operator; besides, it tackles a real-world 
problem found in many organizations. Besides, it is open source software, avail- 



able from the Subversion repository https : //f or ja. rediris . es/websvn/wsvn/geneura/GeneradorXSLT/xml 



In these initial experiments we have found which kind of XSLT template 
structure is the most adequate for evolution, namely, one that matches the se- 
lect attribute in apply-templates with the match attribute in templates, and an 
indeterminate number of value-of instructions within each template; that is the 
one called Type 2; this result is consistent with those found in our previous 
paper [19j . By constraining evolution this way, we restrict the search space to 
a more reasonable size, and avoid the high degree of degeneracy of the prob- 
lem, with many different structures yielding the same result, that, if combined, 
would result in invalid structures. In general, we have also proved that a XSLT 
logicsheet can be found just from an input/output pair of XML documents for 
a wide range of examples, some of them particularly difficult. 

The experiments have shown that the search space is particularly rough, with 
mutations in general leading to huge changes in fitness. The hierarchical fitness 
used is probably the cause of having a big loss of diversity at the beginning of 
the evolutionary search, leading to the need of a higher level of explorations later 
during the algorithm run. This problem will have to be approached via explicit 
diversity-preservation mechanisms, or by using a multiobjective evolutionary 
algorithm, instead of the one used now. A deeper understanding of how different 
operator rates affect the result will also help; for the time being, operator rate 
tuning has been very shallow, and geared towards obtaining the result. As such, 
running times and number of evaluations obtained in this paper can be used as 
a baseline for future versions of the algorithm, or other algorithms for the same 
problem. 

However, there are some questions and issues that will have to be addressed 
in future papers: 

• Using the DTD (associated to a XML file) as a source of information for 
conversions between XML documents and for restrictions of the possible 
variations. 



15 



• Adding different labels in the XSLT to allow the building of different kinds 
of documents such as HTML or WML. 

• Considering the use of advanced XML document comparison tools (i.e. 
XMLdiffl). 

• Testing evolution with other kind of tools, such as a chain of SAX filters. 

• Obviously, testing different kinds and increasingly complex set of docu- 
ments, and using several input and output documents at the same time, 
to test the generalization capability of the procedure. 

• Using the identity transform [T7] as another frame for evolution, as an 
alternative to the types (which we have called 1 and 2) shown here. The 
identity transform puts every element found in the input document in 
the output document; elements can then be selectively eliminated via the 
addition of single statements. 

• Tackle difficult problems from the point of view of a human operator. In 
general, the XSLT stylesheets found here could have been programmed 
by a knowledgeable person in around an hour, but in some cases, in- 
put/output mapping would not be so obvious at first sight. This will mean, 
in general, increase also the XSLT statements used in the stylesheet, and 
also in general, adding new types of operators. 
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